AIbase
Home
AI Tools
AI Models
MCP
AI NEWS
EN
Model Selection
Tags
Cross-modal Alignment

# Cross-modal Alignment

Vit So400m Patch16 Siglip 256.webli I18n
Apache-2.0
A vision Transformer model based on SigLIP, focusing on image feature extraction with original attention pooling mechanism.
Image Classification Transformers
V
timm
15
0
Vit Large Patch14 Clip 224.datacompxl
Apache-2.0
A vision Transformer model based on the CLIP architecture, specifically designed for image feature extraction, released by the LAION organization.
Image Classification Transformers
V
timm
14
0
Mblip Bloomz 7b
MIT
mBLIP is a multilingual vision-language model based on the BLIP-2 architecture, supporting image caption generation and visual question answering tasks in 96 languages.
Image-to-Text Transformers Supports Multiple Languages
M
Gregor
21
1
Mblip Mt0 Xl
MIT
mBLIP is a multilingual vision-language model based on BLIP-2 architecture, supporting image caption generation and visual question answering tasks in 96 languages.
Image-to-Text Transformers Supports Multiple Languages
M
Gregor
374
14
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
English简体中文繁體中文にほんご
© 2025AIbase